Omics - Analysis of large-scale biomolecular datasets Introduction to R

KIMN20 - LTH

Daniel Nilsson
Massimiliano Volpe

2025-10-30

Introduction to R

Course date: 04 November 2025

Last modified: 2025-10-30

Welcome to R Programming! 🐧

This presentation will teach you the fundamentals of R programming for data analysis.

What we’ll cover:

  • R language basics and getting help
  • Working with packages
  • Variables and data types
  • Data structures (vectors, matrices, lists, data frames)
  • Reading data from files
  • Writing functions

What R is?

R is a programming language and environment designed for statistical computing and graphics.

Key Characteristics:

  • Programming Language: High-level language for data analysis and visualization
  • Programming Platform: Complete environment with interpreter and development tools
  • Open-Source Project: Driven by the R core team and global community
  • Statistical Powerhouse: Specialized for statistical analysis and modeling
  • General-Purpose Tool: Can handle diverse computational tasks
  • Idea-to-Implementation Bridge: Transforms concepts into working solutions

What R is NOT:

  • A replacement for statistical expertise
  • The “best” programming language for every task
  • Always the most elegant or efficient solution

Yet R Excels At:

  • Statistical computing and data analysis
  • High-quality graphics and visualization
  • Rapid prototyping of analytical methods
  • Integration with other scientific tools

Why R in Bioinformatics?

R is the de facto standard for bioinformatics analysis.

Statistical Power:

  • Comprehensive statistics and cutting-edge methods
  • Reproducible, script-based research

Specialized Ecosystem:

  • Bioconductor: 2,000+ packages for genomics, proteomics, sequencing
  • CRAN: 18,000+ general-purpose packages
  • Complete analysis pipelines

Visualization Excellence:

  • Publication-quality graphics (ggplot2)
  • Interactive and specialized bioinformatics plots

Active Community:

  • Regular updates and new developments
  • Integrates with Python, databases, web services

Getting Started with R

Opening R/RStudio

  • Command line: Type R in terminal
  • RStudio: Download from https://posit.co/download/rstudio-desktop/
  • Web-based: Use RStudio Cloud or Google Colab with R kernel
  • RStudio Server:https://130.235.8.214/rstudio

Your First R Commands

# This is a comment - it won't execute
print("Hello, R!")
[1] "Hello, R!"

Try running this command in your R console!

R as a Calculator

# Basic arithmetic
2 + 3
[1] 5
10 - 4
[1] 6
5 * 6
[1] 30
20 / 4
[1] 5
2^3
[1] 8

Getting Help

Built-in Help Functions

# Get help for a function
?print
help(print)

# Search for functions
??plot
help.search("plot")

Help for Packages

# Get help for installed packages
help(package = "base")

# Vignettes (detailed tutorials)
vignette()
# vignette("ggplot2")  # Commented out to prevent browser opening

Working with Packages

Installing Packages

# Commented out to prevent browser opening
# Install from CRAN (main repository)
install.packages("tidyverse")

# Install multiple packages
install.packages(c("dplyr", "ggplot2", "readr"))

Loading Packages

# Load a package
library(tidyverse)

# Load with require() (returns TRUE/FALSE)
require(tidyverse)

Package Management

# See loaded packages
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS 15.7.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Stockholm
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.4 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
 [5] purrr_1.0.2     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
 [9] ggplot2_3.5.1   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] gtable_0.3.6      jsonlite_1.8.9    compiler_4.4.1    tidyselect_1.2.1 
 [5] scales_1.3.0      yaml_2.3.10       fastmap_1.2.0     R6_2.5.1         
 [9] generics_0.1.3    knitr_1.50        munsell_0.5.1     pillar_1.9.0     
[13] tzdb_0.4.0        rlang_1.1.4       utf8_1.2.4        stringi_1.8.4    
[17] xfun_0.53         timechange_0.3.0  cli_3.6.3         withr_3.0.2      
[21] magrittr_2.0.3    digest_0.6.37     grid_4.4.1        hms_1.1.3        
[25] lifecycle_1.0.4   vctrs_0.6.5       evaluate_1.0.1    glue_1.8.0       
[29] fansi_1.0.6       colorspace_2.1-1  rmarkdown_2.30    tools_4.4.1      
[33] pkgconfig_2.0.3   htmltools_0.5.8.1
# Update packages
update.packages()

# Remove packages
remove.packages("package_name")

Bioconductor

Bioconductor is the premier repository for R packages in bioinformatics and computational biology.

Key Features:

  • Specialized Tools: Packages for genomics, proteomics, sequencing analysis
  • Quality Assurance: Rigorous review process ensures high-quality, well-documented packages
  • Integrated Workflows: Complete analysis pipelines from raw data to publication
  • Active Community: Regular updates and new package releases

Installing Bioconductor Packages

# Install Bioconductor manager (one time)
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# Install packages
BiocManager::install("DESeq2")        # RNA-seq analysis
BiocManager::install("edgeR")         # Differential expression
BiocManager::install("limma")         # Microarray analysis
BiocManager::install("GenomicRanges") # Genomic intervals
  • DESeq2: RNA-seq differential expression analysis
  • edgeR: Alternative for RNA-seq and count data
  • limma: Linear models for microarray data
  • Biostrings: DNA/RNA sequence manipulation
  • GenomicRanges: Working with genomic coordinates

Variables and Assignment

A variable is a named storage location in memory that holds a value or object. Variables allow you to store, reference, and manipulate data throughout your R session.

Creating Variables

# Assignment operators
x <- 5
y = 10
20 -> z

# Print variables
x
[1] 5
print(y)
[1] 10

Note: <- is preferred for assignment, = is used for function arguments.

Variable Names

# Valid names
my_variable <- 1
myVariable <- 2
.variable <- 3
variable. <- 4

# Invalid names (would cause errors)
# 1variable <- 5
# my-variable <- 6
# my variable <- 7

Variable Naming Conventions

Rules for valid variable names in R:

Valid Names:

  • Start with a letter or dot (not followed by number)
  • Contain letters, numbers, dots, or underscores
  • Examples: my_var, myVariable, .hidden, var2

Invalid Names:

  • Start with number: 2variable
  • Contain spaces: my variable
  • Use hyphens: my-variable
  • Start with dot + number: .2way

Variable Naming Conventions

Rules for valid variable names in R:

Reserved Words (cannot use):

  • if, else, repeat, while, function
  • for, in, next, break
  • TRUE, FALSE, NULL, Inf, NaN
  • NA, NA_integer_, NA_real_, NA_complex_, NA_character_
  • Built-in functions: c, q, t, C, D, I
  • Avoid using: T, F (short for TRUE/FALSE)

Variable Naming Style

Best practices for readable, maintainable code:

Make Names Informative:

  • Use descriptive names: patient_ages instead of x
  • Avoid cryptic abbreviations: genotypes instead of fsjht45jkhsdf4

Choose a Consistent Convention:

  • snake_case: my_variable, patient_data
  • camelCase: myVariable, patientData
  • dot.case: my.variable, patient.data
  • Pick one and stick with it throughout your code

Variable Naming Style

Best practices for readable, maintainable code:

Keep Names Reasonable Length:

  • Short enough: tmp for temporary variables
  • Long enough: gene_expression_matrix not g_e_m
  • Avoid: my.variable.2 → use my.variable2

Common Conventions:

  • i, j, k: Loop counters
  • tmp: Temporary variables
  • df: Data frames
  • n: Counts or lengths
  • idx: Indices

Data Types and Classes

Basic Data Types

# Numeric (double)
num <- 3.14
class(num)
[1] "numeric"
# Integer
int <- 42L # L forces the number to be stored as integer
class(int)
[1] "integer"
# Character (string)
text <- "Hello World"
class(text)
[1] "character"
# Logical (boolean)
logic <- TRUE
class(logic)
[1] "logical"

Type Checking

# Check data types
is.numeric(num)
[1] TRUE
is.integer(int)
[1] TRUE
is.character(text)
[1] TRUE
is.logical(logic)
[1] TRUE
# Check if it's a specific type
is.double(num)
[1] TRUE
is.integer(text)
[1] FALSE

Type Casting and Conversion

Converting Between Types

# Numeric to character
as.character(3.14)
[1] "3.14"
# Character to numeric
as.numeric("42.5")
[1] 42.5
# Numeric to integer
as.integer(3.9)
[1] 3
# Logical to numeric
as.numeric(TRUE)  # Returns 1
[1] 1
as.numeric(FALSE) # Returns 0
[1] 0

Common Conversions

# String to logical
as.logical("TRUE")
[1] TRUE
as.logical("false")
[1] FALSE
# Factor to character
factor_var <- factor(c("A", "B", "A"))
as.character(factor_var)
[1] "A" "B" "A"

Basic String Operations

String Basics

# Create strings
name <- "Alice"
greeting <- 'Hello'

# String length
nchar(name)
[1] 5
# Combine strings
paste("Hello", "World")
[1] "Hello World"
paste("Hello", "World", sep = " ")
[1] "Hello World"

String Manipulation

# Substring
substr("Hello World", 1, 5)
[1] "Hello"
# Upper/lower case
toupper("hello")
[1] "HELLO"
tolower("WORLD")
[1] "world"
# Split strings
strsplit("Hello World", " ")
[[1]]
[1] "Hello" "World"

Complex Data Structures

R provides powerful data structures for organizing and manipulating data efficiently.

Using basic data types (numeric, logical, character), we can construct more complex data structures:

Data Structures in R

Overview of R Data Structures:

  • Vectors: One-dimensional arrays of the same data type
  • Matrices: Two-dimensional arrays of the same data type
  • Lists: Ordered collections that can contain different data types
  • Data Frames: Two-dimensional tables (like spreadsheets) with mixed data types

These structures form the foundation for data analysis in bioinformatics!

Vectors

Vectors (or Atomic Vectors) are one-dimensional arrays of the same data type.

Creating Vectors

# Numeric vector
numbers <- c(1, 2, 3, 4, 5)
numbers
[1] 1 2 3 4 5
# Character vector
fruits <- c("apple", "banana", "orange")
fruits
[1] "apple"  "banana" "orange"
# Logical vector
booleans <- c(TRUE, FALSE, TRUE)
booleans
[1]  TRUE FALSE  TRUE

Vector Operations

# Vector arithmetic
x <- c(1, 2, 3)
y <- c(4, 5, 6)
x + y
[1] 5 7 9
x * y
[1]  4 10 18
# Vector functions
length(numbers)
[1] 5
sum(numbers)
[1] 15
mean(numbers)
[1] 3

Vector Indexing

# Access elements
fruits[1]       # First element
[1] "apple"
fruits[2:3]     # Second and third
[1] "banana" "orange"
fruits[c(1,3)]  # First and third
[1] "apple"  "orange"
# Negative indexing (exclude)
fruits[-1]      # All except first
[1] "banana" "orange"
fruits[-c(1,3)] # Exclude first and third
[1] "banana"

Named Vectors

# Create named vector
ages <- c(Alice = 25, Bob = 30, Carol = 35)
ages
Alice   Bob Carol 
   25    30    35 
# Access by name
ages["Alice"]
Alice 
   25 
ages[c("Alice", "Carol")]
Alice Carol 
   25    35 

Matrices

Matrices are two-dimensional arrays of the same data type.

Creating Matrices

# Create matrix from vector
matrix(1:9, nrow = 3, ncol = 3)
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
# By row or column
matrix(1:6, nrow = 2, byrow = TRUE)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
matrix(1:6, nrow = 2, byrow = FALSE)
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Matrix Operations

# Create sample matrix
m <- matrix(1:4, nrow = 2)
m
     [,1] [,2]
[1,]    1    3
[2,]    2    4
# Matrix dimensions
dim(m)
[1] 2 2
nrow(m)
[1] 2
ncol(m)
[1] 2

Matrix Indexing

# Access elements
m[1, 2]    # Row 1, Column 2
[1] 3
m[1, ]     # Entire row 1
[1] 1 3
m[, 2]     # Entire column 2
[1] 3 4
# Multiple elements
m[1:2, 1]  # Rows 1-2, Column 1
[1] 1 2

Matrix Arithmetic

# Matrix operations
m1 <- matrix(1:4, nrow = 2)
m2 <- matrix(5:8, nrow = 2)

m1 + m2
     [,1] [,2]
[1,]    6   10
[2,]    8   12
m1 * m2  # Element-wise
     [,1] [,2]
[1,]    5   21
[2,]   12   32
m1 %*% m2 # Matrix multiplication
     [,1] [,2]
[1,]   23   31
[2,]   34   46

Lists

Lists are ordered collections that can contain different data types.

Creating Lists

# Simple list
my_list <- list(1, "hello", TRUE)
my_list
[[1]]
[1] 1

[[2]]
[1] "hello"

[[3]]
[1] TRUE

Creating Lists

# Named list
person <- list(
  name = "Alice",
  age = 25,
  scores = c(85, 90, 88)
)
person
$name
[1] "Alice"

$age
[1] 25

$scores
[1] 85 90 88

Accessing List Elements

# By index
person[[1]]  # First element
[1] "Alice"
person[[3]]  # Third element
[1] 85 90 88
# By name
person$name
[1] "Alice"
person[["age"]]
[1] 25
person$scores
[1] 85 90 88

List Operations

# List length
length(person)
[1] 3
# Names of elements
names(person)
[1] "name"   "age"    "scores"
# Add elements
person$city <- "Stockholm"
person
$name
[1] "Alice"

$age
[1] 25

$scores
[1] 85 90 88

$city
[1] "Stockholm"

List Operations

# Remove elements
person$city <- NULL
person
$name
[1] "Alice"

$age
[1] 25

$scores
[1] 85 90 88

Data Frames

Data Frames are two-dimensional tables (like spreadsheets) with mixed data types

Creating Data Frames

# Create data frame
students <- data.frame(
  name = c("Alice", "Bob", "Carol"),
  age = c(20, 21, 19),
  grade = c("A", "B", "A")
)
students
   name age grade
1 Alice  20     A
2   Bob  21     B
3 Carol  19     A

Data Frame Properties

# Structure
str(students)
'data.frame':   3 obs. of  3 variables:
 $ name : chr  "Alice" "Bob" "Carol"
 $ age  : num  20 21 19
 $ grade: chr  "A" "B" "A"
# Dimensions
dim(students)
[1] 3 3
nrow(students)
[1] 3
ncol(students)
[1] 3
# Column names
names(students)
[1] "name"  "age"   "grade"

Data Frame Indexing

# Access columns
students
   name age grade
1 Alice  20     A
2   Bob  21     B
3 Carol  19     A
students$name
[1] "Alice" "Bob"   "Carol"
students[["age"]]
[1] 20 21 19
students[, 2]  # Second column
[1] 20 21 19
# Access rows
students[1, ]  # First row
   name age grade
1 Alice  20     A
students[1:2, ] # First two rows
   name age grade
1 Alice  20     A
2   Bob  21     B
students
   name age grade
1 Alice  20     A
2   Bob  21     B
3 Carol  19     A
# Access specific elements
students[1, 2]  # Row 1, Column 2
[1] 20
students[1, "age"] # Row 1, age column
[1] 20

Data Frame Operations

# Add column
students$passed <- c(TRUE, TRUE, FALSE)
students
   name age grade passed
1 Alice  20     A   TRUE
2   Bob  21     B   TRUE
3 Carol  19     A  FALSE
# Summary statistics
summary(students)
     name                age          grade             passed       
 Length:3           Min.   :19.0   Length:3           Mode :logical  
 Class :character   1st Qu.:19.5   Class :character   FALSE:1        
 Mode  :character   Median :20.0   Mode  :character   TRUE :2        
                    Mean   :20.0                                     
                    3rd Qu.:20.5                                     
                    Max.   :21.0                                     

Data Frame Operations

# View first/last rows
head(students)
   name age grade passed
1 Alice  20     A   TRUE
2   Bob  21     B   TRUE
3 Carol  19     A  FALSE
tail(students)
   name age grade passed
1 Alice  20     A   TRUE
2   Bob  21     B   TRUE
3 Carol  19     A  FALSE

Reading Data

Reading CSV Files

# Read CSV file
data <- read.csv("assets/data.csv")
head(data)
  Gene_Symbol         Gene_ID Control_1 Control_2 Treatment_1 Treatment_2
1       GAPDH ENSG00000111640      1250      1180         980        1050
2        ACTB ENSG00000075624       890       920         750         820
3        TP53 ENSG00000141510       450       480         680         720
4         MYC ENSG00000136997       320       350         580         620
5        EGFR ENSG00000146648       180       200         450         480
6       BRCA1 ENSG00000012048       150       170         380         410

Reading CSV Files

Factors in R are a special data type used to represent categorical data with predefined levels. They’re like labeled categories that can be ordered or unordered. stringsAsFactors = FALSE prevents automatic conversion of strings to factors.

# With custom options
data <- read.csv("assets/data.csv", 
                 header = TRUE,
                 sep = ",",
                 stringsAsFactors = FALSE)
head(data)
  Gene_Symbol         Gene_ID Control_1 Control_2 Treatment_1 Treatment_2
1       GAPDH ENSG00000111640      1250      1180         980        1050
2        ACTB ENSG00000075624       890       920         750         820
3        TP53 ENSG00000141510       450       480         680         720
4         MYC ENSG00000136997       320       350         580         620
5        EGFR ENSG00000146648       180       200         450         480
6       BRCA1 ENSG00000012048       150       170         380         410

Reading Excel Files

# Install readxl if needed
# install.packages("readxl")
library(readxl)

# Read Excel file
excel_data <- read_excel("assets/data.xlsx")
head(excel_data)
# A tibble: 6 × 6
  Gene_Symbol Gene_ID         Control_1 Control_2 Treatment_1 Treatment_2
  <chr>       <chr>               <dbl>     <dbl>       <dbl>       <dbl>
1 GAPDH       ENSG00000111640      1250      1180         980        1050
2 ACTB        ENSG00000075624       890       920         750         820
3 TP53        ENSG00000141510       450       480         680         720
4 MYC         ENSG00000136997       320       350         580         620
5 EGFR        ENSG00000146648       180       200         450         480
6 BRCA1       ENSG00000012048       150       170         380         410
# Read specific sheet
sheet_data <- read_excel("assets/data.xlsx", sheet = "Sheet 1")
head(sheet_data)
# A tibble: 6 × 6
  Gene_Symbol Gene_ID         Control_1 Control_2 Treatment_1 Treatment_2
  <chr>       <chr>               <dbl>     <dbl>       <dbl>       <dbl>
1 GAPDH       ENSG00000111640      1250      1180         980        1050
2 ACTB        ENSG00000075624       890       920         750         820
3 TP53        ENSG00000141510       450       480         680         720
4 MYC         ENSG00000136997       320       350         580         620
5 EGFR        ENSG00000146648       180       200         450         480
6 BRCA1       ENSG00000012048       150       170         380         410

Other File Formats

# Read tab-separated file
tsv_data <- read.delim("assets/data.tsv")
head(tsv_data)
  Gene_Symbol         Gene_ID Control_1 Control_2 Treatment_1 Treatment_2
1       GAPDH ENSG00000111640      1250      1180         980        1050
2        ACTB ENSG00000075624       890       920         750         820
3        TP53 ENSG00000141510       450       480         680         720
4         MYC ENSG00000136997       320       350         580         620
5        EGFR ENSG00000146648       180       200         450         480
6       BRCA1 ENSG00000012048       150       170         380         410
# Read from URL
# url_data <- read.csv("https://example.com/data.csv")

# Write data
write.csv(students, "students.csv")
write.csv(students, "students.csv", row.names = FALSE)

Functions

Functions are reusable blocks of code that perform specific tasks. They help organize code, reduce repetition, and make programs more modular and maintainable.

R Functions Overview

Why Use Functions?

  • Reusability: Write once, use many times
  • Organization: Break complex tasks into smaller, manageable pieces
  • Maintainability: Easier to debug and update code
  • Abstraction: Hide implementation details

Functions take inputs (arguments), perform operations, and return outputs (results).

Writing Functions

Basic Function Structure

# Simple function
greet <- function(name) {
  message <- paste("Hello,", name, "!")
  return(message)
}

# Call the function
greet("Alice")
[1] "Hello, Alice !"

Functions with Multiple Parameters

# Function with multiple parameters
calculate_bmi <- function(weight_kg, height_m) {
  bmi <- weight_kg / (height_m^2)
  return(bmi)
}

# Usage
calculate_bmi(70, 1.75)
[1] 22.85714

Functions with Default Values

# Function with defaults
power_function <- function(x, power = 2) {
  result <- x^power
  return(result)
}

# Usage
power_function(3)      # Uses default power = 2
[1] 9
power_function(3, 3)   # Uses power = 3
[1] 27

Functions with Conditional Logic

# Function with if-else
grade_letter <- function(score) {
  if (score >= 90) {
    return("A")
  } else if (score >= 80) {
    return("B")
  } else if (score >= 70) {
    return("C")
  } else {
    return("F")
  }
}

# Usage
grade_letter(85)
[1] "B"
grade_letter(92)
[1] "A"

Quiz Time! 🧠

Question 1

Which operator is used for assignment in R? - A) = - B) <- - C) Both A and B

Question 2

What function creates a vector in R? - A) vector() - B) c() - C) make_vector()

Question 3

How do you access the first element of a vector named my_vec? - A) my_vec[0] - B) my_vec[1] - C) my_vec.first

Quiz Time! 🧠

(Answers: 1-C, 2-B, 3-B)

Next Steps

What to Learn Next

  • Data manipulation with dplyr
  • Data visualization with ggplot2
  • Statistical analysis functions
  • Writing R scripts and RMarkdown
  • Advanced data structures
  • Working with dates and times

Resources

Thank You!

You’ve completed the basic R programming tutorial!

Remember:

  • Use ?function_name for help
  • Install packages with install.packages()
  • Load packages with library()
  • Practice regularly with real data

Questions?

Feel free to ask your instructor or classmates!

🐧 Happy R programming! 🐧

Practice Time! 💻

Exercise 1: Variables and Typesß

  1. Create variables for your name, age, and height
  2. Check their data types with class()
  3. Convert your age to character and height to integer
# Your code here
name <- "Your Name"
age <- 25
height <- 1.75

class(name)
[1] "character"
class(age)
[1] "numeric"
class(height)
[1] "numeric"
as.character(age)
[1] "25"
as.integer(height)
[1] 1

Exercise 2: Vectors and Data Frames

  1. Create a vector of your favorite numbers
  2. Create a data frame with names and scores
  3. Calculate the mean of the scores
# Your code here
numbers <- c(1, 5, 10, 15, 20)

scores_df <- data.frame(
  name = c("Alice", "Bob", "Carol"),
  score = c(85, 92, 78)
)

mean(scores_df$score)
[1] 85